NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Markov Property of Empirical Distributions and the Performance of Compression-Based Denoisers

https://doi.org/10.1109/ISIT63088.2025.11195706

Song, Dan; Özgür, Ayfer; Weissman, Tsachy (June 2025, 2025 IEEE International Symposium on Information Theory (ISIT))

Free, publicly-accessible full text available June 22, 2026
Adaptive Compression in Federated Learning via Side Information

Isik, Berivan; Pase, Francesco; Gunduz, Deniz; Koyejo, Sanmi; Weissman, Tsachy; Zorzi, Michele (May 2024, Proceedings of Machine Learning Research)

Full Text Available
Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

https://doi.org/10.1038/s41598-023-29267-8

Meng, Qingxi; Chandak, Shubham; Zhu, Yifan; Weissman, Tsachy (December 2023, Scientific Reports)

Abstract The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35–0.65 bits per base which is 3–6$$\times$$ $\times$ lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4$$\times$$ $\times$ faster decompression with 20 threads). NanoSpring is available on GitHub athttps://github.com/qm2/NanoSpring.
more » « less
Full Text Available
Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation

Isik, Berivan; Chen, Wei-Ning; Ozgur, Ayfer; Weissman, Tsachy; No, Albert (December 2023, Neurips)

Full Text Available
Neural Network Compression for Noisy Storage Devices

https://doi.org/10.1145/3588436

Isik, Berivan; Choi, Kristy; Zheng, Xin; Weissman, Tsachy; Ermon, Stefano; Wong, H.-S. Philip; Alaghi, Armin (May 2023, ACM Transactions on Embedded Computing Systems)

Compression and efficient storage ofneural network (NN)parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actualphysicalstorage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media witherror-correcting codes (ECCs)provide robust error-free storage. However, this decoupled approach is inefficient as it ignores the overparameterization present in most NNs and forces the memory device to allocate the same amount of resources to every bit of information regardless of its importance. In this work, we investigate analog memory devices as an alternative to digital media – one that naturally provides a way to add more protection for significant bits unlike its counterpart, but is noisy and may compromise the stored model’s performance if used naively. We develop a variety of robust coding strategies for NN weight storage on analog devices, and propose an approach to jointly optimize model compression and memory resource allocation. We then demonstrate the efficacy of our approach on models trained on MNIST, CIFAR-10, and ImageNet datasets for existing compression techniques. Compared to conventional error-free digital storage, our method reduces the memory footprint by up to one order of magnitude, without significantly compromising the stored model’s accuracy.
more » « less
Full Text Available
Geometric Lower Bounds for Distributed Parameter Estimation Under Communication Constraints

https://doi.org/10.1109/TIT.2021.3108952

Han, Yanjun; Ozgur, Ayfer; Weissman, Tsachy (December 2021, IEEE Transactions on Information Theory)

Full Text Available
Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

https://doi.org/10.1093/bioinformatics/btaa1017

Chandak, Shubham; Tatwawadi, Kedar; Sridhar, Srivatsan; Weissman, Tsachy (December 2020, Bioinformatics)
Valencia, Alfonso (Ed.)
Abstract Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Optimal rates of entropy estimation over Lipschitz balls

https://doi.org/10.1214/19-AOS1927

Han, Yanjun; Jiao, Jiantao; Weissman, Tsachy; Wu, Yihong (December 2020, Annals of Statistics)
null (Ed.)
Full Text Available
MEOW: A Space-Efficient Nonparametric Bid Shading Algorithm

https://doi.org/10.1145/3447548.3467113

Zhang, Wei; Kitts, Brendan; Han, Yanjun; Zhou, Zhengyuan; Mao, Tingyu; He, Hao; Pan, Shengjun; Flores, Aaron; Gultekin, San; Weissman, Tsachy (August 2021, KDD 2021)

Full Text Available
Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types

https://doi.org/10.1093/imaiai/iaz025

Mardia, Jay; Jiao, Jiantao; Tánczos, Ervin; Nowak, Robert D; Weissman, Tsachy (November 2019, Information and Inference: A Journal of the IMA)

Abstract We study concentration inequalities for the Kullback–Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $$n$$ and alphabet size $$k$$, and the improvement becomes more significant when $$k$$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $$L_1$$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviours between small and large sample sizes compared to the alphabet size.
more » « less
Full Text Available

« Prev Next »

Search for: All records